Load the packages and libraries needed for this lab
library(tidyverse)
## -- Attaching packages -------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.3 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts ----------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
# Read in the global time series data and pivot to long format for the dates
time_series_confirmed_long <- read_csv(url("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv")) %>%
rename(Province_State = "Province/State", Country_Region = "Country/Region") %>%
pivot_longer(-c(Province_State, Country_Region, Lat, Long),
names_to = "Date", values_to = "Confirmed")
## Parsed with column specification:
## cols(
## .default = col_double(),
## `Province/State` = col_character(),
## `Country/Region` = col_character()
## )
## See spec(...) for full column specifications.
# Now do the same for the death data over time
time_series_deaths_long <- read_csv(url("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv")) %>%
rename(Province_State = "Province/State", Country_Region = "Country/Region") %>%
pivot_longer(-c(Province_State, Country_Region, Lat, Long),
names_to = 'Date', values_to = 'Deaths')
## Parsed with column specification:
## cols(
## .default = col_double(),
## `Province/State` = col_character(),
## `Country/Region` = col_character()
## )
## See spec(...) for full column specifications.
# Create the Keys for joining data
time_series_confirmed_long <- time_series_confirmed_long %>%
unite(Key, Province_State, Country_Region, Date, sep = '.', remove = FALSE)
time_series_deaths_long <- time_series_deaths_long %>%
unite(Key, Province_State, Country_Region, Date, sep = '.') %>%
select(Key, Deaths)
# Now full_join the tables
time_series_long_joined <- full_join(time_series_confirmed_long,
time_series_deaths_long, by = c("Key")) %>%
select(-Key)
# Reformat the data to make the date values Date objects
time_series_long_joined$Date <- mdy(time_series_long_joined$Date)
# Create a Report table with counts like before to graph both Confirmed and Deaths
time_series_long_joined_counts <- time_series_long_joined %>%
pivot_longer(-c(Province_State, Country_Region, Lat, Long, Date),
names_to = "Report_Type", values_to = "Counts")
You can control the graph size of RMarkdown files in the code chunk {r} using various arguments, such as eval = FALSE that will suppress the code execution but will still show the code. Alternatively, code = FALSE will exectute the code but not show the code. You can also suppress error messages and warnings so the report won’t include them (although you should still be aware that they exist). To edit the dimensions of an individual graph, specify the dimensions of the graph using fig.width and fig.height
You can save a plot/figure using a wide variety of different formats. This is good for controlling the final dimensions in a reproducible way for manuscripts and such.
# Plot a graph to a pdf output file
pdf('images/time_series_example_plot.pdf', width = 6, height = 3)
time_series_long_joined %>%
group_by(Country_Region, Date) %>%
summarize_at(c("Confirmed", "Deaths"), sum) %>% # Recall that summarize_at summarizes for multiple variables
filter(Country_Region == "US") %>%
ggplot(aes(x = Date, y = Deaths)) +
geom_point() +
geom_line() +
ggtitle("US COVID-19 Deaths")
dev.off() # Turn off the input to the PDF file
## png
## 2
# Now plot the graph to a png output file, same process
ppi <- 300
png('images/time_series_example_plot.png', width = 6*ppi, height = 6*ppi, res = ppi)
time_series_long_joined %>%
group_by(Country_Region, Date) %>%
summarize_at(c("Confirmed", "Deaths"), sum) %>%
filter(Country_Region == "US") %>%
ggplot(aes(x = Date, y = Deaths)) +
geom_point() +
geom_line() +
ggtitle("US COVID-19 Deaths")
dev.off()
## png
## 2
# This created a much larger plot with much higher resolution
Loading images into an Rmd file works with png files, but not pdf files. Other images can also load images made from other bioinformatic tools into an Rmd report.
This is the RMarkdown style for inserting images Your image must be in your working directory to work This command must be put OUTSIDE of the r code chunk
US COVID-19 Deaths
This is the alternative way for using html - You still must have the image in the working directory or specify the path The html is put OUTSIDE of the r code chunk, like for RMarkdown
Finally, a third way is to use R code but specify echo = FALSE in the r chunk to allow the code output to be displayed but not the code itself
plotly/ggplotly allow interactive graphs to be made for lab reports. The two common formats for making graphs is to (1) display the graph directly by calling the function or (2) point the plot to a variable and call the variable. An example of (1), where the entire graph production is done at once -
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
ggplotly(
time_series_long_joined %>%
group_by(Country_Region, Date) %>%
summarize_at(c("Confirmed", "Deaths"), sum) %>%
filter(Country_Region == "US") %>%
ggplot(aes(x = Date, y = Deaths)) +
geom_point() +
geom_line() +
ggtitle("US COVID-19 Deaths")
)
ggplotly converts a ggplot() object to a plotly object to create an interactive graph. There are additional arguments beside the ggplot object that can be used to further customize the created interactive graph - see documentation for details.
An example of (2), where the graph production is separated into variables and then called -
# Subset te time series data to include the US deaths
US_deaths <- time_series_long_joined %>%
group_by(Country_Region, Date) %>%
summarize(Deaths = sum(Deaths)) %>%
filter(Country_Region == "US")
## `summarise()` regrouping output by 'Country_Region' (override with `.groups` argument)
# Collect the layers for a graph of the US time series data for COVID deaths
p <- ggplot(data = US_deaths, aes(x = Date, y = Deaths)) +
geom_point() +
geom_line() +
ggtitle("US COVID-19 Deaths")
# Now plot the graph using ggplotly
ggplotly(p) # Same interactive graph
Animated gganimate graphs are essentially a bunch of files to create a gif, which can be used for a report or even saved for use in a slide presentation or something else. Some important gganimate functions are - transition_() functions define how the data should be spread out and transitions over time. view_() functions define how the positional scales should change with the animation. shadow_() functions define how data from other points in time should appear at a given point in time during the animation. enter_() and exit_*() functions define how new data should appear and old data disappear in the animation, respectively. ease_aes() defines how different aesthetics should be eased (moved) during transitions
# Creating animations
library(gganimate)
library(transformr)
library(gifski)
theme_set(theme_bw())
# Create an animation of the confirmed cases in select countries
data_time <- time_series_long_joined %>%
group_by(Country_Region, Date) %>%
summarize_at(c('Confirmed', 'Deaths'), sum) %>%
filter(Country_Region %in% c("China","Korea, South","Japan","Italy","US"))
p <- ggplot(data_time, aes(x = Date, y = Confirmed, color = Country_Region)) +
geom_point() +
geom_line() +
ggtitle("Confirmed COVID-19 Cases") + # Normal ggplot up to this point
geom_point(aes(group = seq_along(Date))) + # This keeps the points on the graphs by grouping all prior points
transition_reveal(Date) # Reveal the data over the x-axis variable (Date)
animate(p, renderer = gifski_renderer(), end_pause = 15) # Creates the animation
# The output can be saved to a gif to be used elsewhere (like slide presentations)
# by using anim_save() function. Here is the same animation saved as a gif
anim_save('images/confirmed_5_countries.gif', renderer = gifski_renderer(), p)
Print a graph (different from the above ones) to a png file using 3*ppi for the height and width, then display the png file in the report using RMarkdown format
Let’s make a graph showing the number of confirmed cases in MA over time -
time_series_confirmed_long_MA_OK <- read_csv(url("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv")) %>%
filter(Province_State == c("Massachusetts", 'Oklahoma')) %>% # States were I and my family live
select(-c('UID', 'iso2', 'iso3', 'FIPS', "code3", 'Admin2')) %>%
pivot_longer(-c(Province_State, Country_Region, Lat, Long_, Combined_Key), names_to = "Date", values_to = "Confirmed") %>%
select(c(Province_State, Date, Confirmed)) %>%
group_by(Province_State, Date) %>%
summarize(Confirmed = sum(Confirmed))
## Parsed with column specification:
## cols(
## .default = col_double(),
## iso2 = col_character(),
## iso3 = col_character(),
## Admin2 = col_character(),
## Province_State = col_character(),
## Country_Region = col_character(),
## Combined_Key = col_character()
## )
## See spec(...) for full column specifications.
## `summarise()` regrouping output by 'Province_State' (override with `.groups` argument)
head(time_series_confirmed_long_MA_OK) # Table was created correctly
## # A tibble: 6 x 3
## # Groups: Province_State [1]
## Province_State Date Confirmed
## <chr> <chr> <dbl>
## 1 Massachusetts 1/22/20 0
## 2 Massachusetts 1/23/20 0
## 3 Massachusetts 1/24/20 0
## 4 Massachusetts 1/25/20 0
## 5 Massachusetts 1/26/20 0
## 6 Massachusetts 1/27/20 0
time_series_confirmed_long_MA_OK$Date <- mdy(time_series_confirmed_long_MA_OK$Date) # Convert Date to Date objects
# Now plot -
plot_MA_OK_Confirmed <- ggplot(time_series_confirmed_long_MA_OK, aes(x = Date, y = Confirmed, color = Province_State)) +
geom_point() +
geom_line() +
scale_x_date(limits = as.Date(c('2020-01-01', '2020-10-01'))) +
labs(y = "Confirmed Cases", title = "Confirmed COVID-19 Cases (MA, OK)") +
scale_color_discrete(name = 'State') +
theme_classic() +
theme(axis.title = element_text(face = 'bold', size = 8),
plot.title = element_text(face = 'bold', hjust = 0.5, size = 10),
legend.title = element_text(face = 'bold', hjust = 0.5, size = 8),
legend.text = element_text(size = 6),
legend.position = 'bottom')
# Now save the graph to a png file with the specifications -
ppi <- 800
png('images/MA_OK_Confirmed.png', width = 3*ppi, height = 3*ppi, res = ppi)
plot_MA_OK_Confirmed
dev.off()
## png
## 2
Now display the graph via RMarkdown syntax -
Turn an exercise plot from Lab 5 into an interactive graph using plotyly
Take the graph from Exercise 7, which plots the deaths from COVID-19 over time for the countries with the top 10 deaths -
# Format the time series data
time_series_deaths <- read_csv("data/time_series_covid19_deaths_global.csv")%>%
rename(Province_State = "Province/State", Country_Region = "Country/Region")
## Parsed with column specification:
## cols(
## .default = col_double(),
## `Province/State` = col_character(),
## `Country/Region` = col_character()
## )
## See spec(...) for full column specifications.
# Determine the countries with the highest death totals
deaths_top10 <- time_series_deaths %>%
select(-c(Lat, Long)) %>% # Remove the extraneous variables
group_by(Country_Region) %>%
summarize(Total_Latest_Deaths = sum(`9/25/20`)) %>%
arrange(desc(Total_Latest_Deaths)) %>%
slice(0:10)
## `summarise()` ungrouping output (override with `.groups` argument)
# Use this data set to filter the time series data
time_series_deaths_top10 <- time_series_deaths %>%
select(-c(Lat, Long)) %>% # Remove the extraneous variables
filter(Country_Region %in% deaths_top10$Country_Region) %>% # Filter for top 10 countries by deaths
pivot_longer(-c(Province_State, Country_Region), names_to = "Date", values_to = "Deaths") %>%
group_by(Country_Region, Date) %>%
summarize(Total_Deaths = sum(Deaths))
## `summarise()` regrouping output by 'Country_Region' (override with `.groups` argument)
mdy(time_series_deaths_top10$Date) -> time_series_deaths_top10$Date
# Assign the plot to a variable and then use plotyly to create interactive graph
plot_top10_deaths_worldwide <- ggplot(time_series_deaths_top10, aes(x = Date, y = Total_Deaths, color = Country_Region)) +
geom_point() +
geom_line() +
labs(y = "Deaths", title = "COVID-19 Deaths by Country in the Top 10 Countries") +
scale_x_date(limits = as.Date(c('2020-01-01', '2020-10-01'))) +
scale_color_discrete(name = "Country") +
theme(plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
axis.title = element_text(size = 14, face = "bold"),
legend.title = element_text(face = 'bold'))
ggplotly(plot_top10_deaths_worldwide)
Create an animated graph of my choice for the time data to display an aspect that is important to me. Choosing to graph the total confirmed cases in the states with the top 10 number of confirmed cases as of today (29 September 2020) -
# Download and format the data -
download.file(url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv",
destfile = 'data/time_series_covid19_confirmed_US.csv')
top10_states_confirmed <- read_csv('data/time_series_covid19_confirmed_US.csv') %>%
group_by(Province_State) %>%
summarize(Confirmed=sum(`9/29/20`)) %>%
arrange(desc(Confirmed)) %>%
slice(1:10)
## Parsed with column specification:
## cols(
## .default = col_double(),
## iso2 = col_character(),
## iso3 = col_character(),
## Admin2 = col_character(),
## Province_State = col_character(),
## Country_Region = col_character(),
## Combined_Key = col_character()
## )
## See spec(...) for full column specifications.
## `summarise()` ungrouping output (override with `.groups` argument)
top10_states_confirmed
## # A tibble: 10 x 2
## Province_State Confirmed
## <chr> <dbl>
## 1 California 815654
## 2 Texas 767264
## 3 Florida 704568
## 4 New York 457649
## 5 Georgia 316306
## 6 Illinois 293448
## 7 Arizona 218184
## 8 North Carolina 209136
## 9 New Jersey 204563
## 10 Tennessee 194611
# Now use the top 10 states to parse the time series data
time_series_confirmed_top10_states <- read_csv('data/time_series_covid19_confirmed_US.csv') %>%
filter(Province_State %in% top10_states_confirmed$Province_State) %>%
select(-c('UID', 'iso2', 'iso3', 'FIPS', "code3", 'Admin2')) %>%
pivot_longer(-c(Province_State, Country_Region, Lat, Long_, Combined_Key), names_to = "Date", values_to = "Confirmed") %>%
select(c(Province_State, Date, Confirmed)) %>%
group_by(Province_State, Date) %>%
summarize(Confirmed = sum(Confirmed))
## Parsed with column specification:
## cols(
## .default = col_double(),
## iso2 = col_character(),
## iso3 = col_character(),
## Admin2 = col_character(),
## Province_State = col_character(),
## Country_Region = col_character(),
## Combined_Key = col_character()
## )
## See spec(...) for full column specifications.
## `summarise()` regrouping output by 'Province_State' (override with `.groups` argument)
time_series_confirmed_top10_states$Date <- mdy(time_series_confirmed_top10_states$Date)
head(time_series_confirmed_top10_states)
## # A tibble: 6 x 3
## # Groups: Province_State [1]
## Province_State Date Confirmed
## <chr> <date> <dbl>
## 1 Arizona 2020-01-22 0
## 2 Arizona 2020-01-23 0
## 3 Arizona 2020-01-24 0
## 4 Arizona 2020-01-25 0
## 5 Arizona 2020-01-26 1
## 6 Arizona 2020-01-27 1
# Now create the plot -
plot_top10_confirmed_states <- ggplot(time_series_confirmed_top10_states, aes(x = Date, y = Confirmed, color = Province_State)) +
geom_line() +
scale_x_date(limits = as.Date(c('2020-01-01', '2020-10-01'))) +
labs(y = "Confirmed Cases", title = "Confirmed COVID-19 Cases in Top 10 US States") +
scale_color_discrete(name = 'State') +
theme_classic() +
theme(axis.title = element_text(face = 'bold', size = 14),
plot.title = element_text(face = 'bold', hjust = 0.5, size = 16),
legend.title = element_text(face = 'bold', hjust = 0.5, size = 14),
legend.text = element_text(size = 10)) +
transition_reveal(Date) +
shadow_trail()
animate(plot_top10_confirmed_states, renderer = gifski_renderer(), end_pause = 20)
From this, you can see how the states with the highest number of confirmed cases have only recently taken the lead due to a massive spike in the number of cases over the summer (California, Texas, Florida). You can also see how the cases continue to go up, especially in the states that have recently achieved the highest number of confirmed cases. You can also see how some of the states with the earliest high numbers have plateaued (New York, New Jersey) while other states are quickly approaching these plateaued states’ numbers (Georgia, North Carolina, Tennessee, Illinois, Arizona). Also note that Massachusetts is no longer in the top 10.